Why most published research findings are false

Johanna Bayer, Maarten Mennes

Replication crisis

” .. false findings may be the majority or even the vast majority of published research claims” and “… it can be proven that most claimed research findings are false.”

  • Reason: findings assessed on p < 0.05 of a single study
  • But: various factors influence the credibility of the significance of a test result (a single significant p value)

Variables to take into account


  • the ratio of “true relationships” and “no relationships” of all those tested in a field, \(R\).
    • in the case of 30% true and 70% false relationships in a field, \(R = \frac{0.3}{0.7}\)
  • Prior probability of a true relationship \(\frac{R}{R+1}\)
  • power of the study \(1-\beta\)
  • signifcance level \(\alpha\)
  • the number of relationships that are probed in the field, \(c\)

Positive predictive value (PPV)

  • The probability that a a finding is true given that it has been identified as true in the research process.


    PPV = \(\frac{(1-\beta)R}{(R+\alpha-R\beta)}\)

Consqeuences from the table

  • a research finding is more likely to be true than false if \((1-\beta)R > \alpha\)

  • when either the power of the prior probability is very low, the chance that a finding is not due to type 1 error, increases!

Bias

  • let \(u\) be the proportion of samples that would not have been “research findings” but end up being preseneted and reported as such, because of bias.

  • \(u\) distorts \(\alpha\) and \(\beta\)

  • PPV decerases with incerasing \(u\) unless \((1-\beta) =< \alpha\)

  • Conversely, true reseach findings might be annulled due to reverse bias.

Testing by different teams

  • with an increasing number of tests the PPV tends to decrease unlesss \(1 − \beta < \alpha\)

PPV = \(R(1 − \beta^n)⁄(R + 1 − (1 − \alpha)^n − R\beta^n)\)

Corollaries derived from the table

The smaller the studies conducted in a scientific field, the less likely the research findings are to be true.

  • small sample sizes means less power
  • research findings are more likely to be true in fields with large studies (RCTs) than in fields with many small studies
  • Relationship between sample size (n), effect size (ES), and power (\(Z_{1-\beta}\))


\(n = 2(\frac{Z_{1-\alpha*0.5} + Z_{1-\beta}}{ES})^2\)

The smaller the effect sizes in a scientific field, the less likely the research findings are to be true.

  • small effect sizes requires more power to detetct the effect size reliably
  • median power in neuroscience studies estimated to be between 8 and 31% (Button et al., 2013)

Decreased PPV with the number of tests

  • Corollary 3: The greater the number and the lesser the selection of tested relationships in a scientific field, the less likely the research findings are to be true
  • Corollary 6: The hotter a scientific field (with more scientific teams involved), the less likely the research findings are to be true.
    • the gerater the number of tests, the lower the pre-study probability of a finding (of all the findings) to be true.

Impact of Bias

  • Corollary 4: The greater the flexibility in designs, definitions, outcomes, and analytical modes in a scientific field, the less likely the research findings are to be true.
  • Corollary 5: The greater the financial and other interests and prejudices in a scientific field, the less likely the research findings are to be true.
  • Flexibility increases the potential for transforming what would be “negative” results into “positive” results, i.e., bias, u.

Interplay between variuos variables and PPV

  • claimed research findings might sometimes only reflect the amount of bias in the field.

Double dipping (Kriegeskorte et al., 2009)

  • One type of bias is “double dipping” - using the same set of data for selection and selective analysis.
  • Example: Hypothesis: there is a region responding more strongly to stimulus A than B, select voxels showing this effect to define an ROI, and then selectively analyze that ROI to test our hypothesis.

Kriegerskorte et al., 2009

  • 134 functional MRI (fMRI) articles published the year
  • 42% contained circular analyses, with the analyses of an additional 14% of papers unclear

d

Too many tests

How can we improve the situation?

  • larger studies and more power
  • development of research standards
  • improve understanding of \(R\) values and pre-study odds
  • use of independent data sets for exaploration and confirmation
  • ?